情绪分析是最基本的NLP任务,用于确定文本数据的极性。在多语言文本领域也有很多工作。仍然讨厌和令人反感的语音检测面临着挑战,这是由于数据的可用性不足,特别是印度和马拉地赛等印度语言。在这项工作中,我们考虑了印地语和马拉地养文本的仇恨和令人反感的语音检测。使用艺术的深度学习方法的状态制定了该问题作为文本分类任务。我们探讨了CNN,LSTM等不同的深度学习架构,以及多语言伯爵,烟草和单晶罗伯塔等伯特的变化。基于CNN和LSTM的基本模型将使用快文文本嵌入式增强。我们使用HASOC 2021 HINDI和MARATHI讨论语音数据集来比较这些算法。 Marathi DataSet由二进制标签和后印度数据集组成,包括二进制和更精细的粗糙标签。我们表明,基于变压器的模型表现了最佳甚至基本型号以及FastText Embeddings的基本模型具有竞争性能。此外,通过普通的超参数调谐,基本模型比细粒度的Hindi数据集上的基于BERT的模型更好。
translated by 谷歌翻译
Monitoring water is a complex task due to its dynamic nature, added pollutants, and land build-up. The availability of high-resolu-tion data by Sentinel-2 multispectral products makes implementing remote sensing applications feasible. However, overutilizing or underutilizing multispectral bands of the product can lead to inferior performance. In this work, we compare the performances of ten out of the thirteen bands available in a Sentinel-2 product for water segmentation using eight machine learning algorithms. We find that the shortwave infrared bands (B11 and B12) are the most superior for segmenting water bodies. B11 achieves an overall accuracy of $71\%$ while B12 achieves $69\%$ across all algorithms on the test site. We also find that the Support Vector Machine (SVM) algorithm is the most favourable for single-band water segmentation. The SVM achieves an overall accuracy of $69\%$ across the tested bands over the given test site. Finally, to demonstrate the effectiveness of choosing the right amount of data, we use only B11 reflectance data to train an artificial neural network, BandNet. Even with a basic architecture, BandNet is proportionate to known architectures for semantic and water segmentation, achieving a $92.47$ mIOU on the test site. BandNet requires only a fraction of the time and resources to train and run inference, making it suitable to be deployed on web applications to run and monitor water bodies in localized regions. Our codebase is available at https://github.com/IamShubhamGupto/BandNet.
translated by 谷歌翻译
In this paper, we discuss an imitation learning based method for reducing the calibration error for a mixed reality system consisting of a vision sensor and a projector. Unlike a head mounted display, in this setup, augmented information is available to a human subject via the projection of a scene into the real world. Inherently, the camera and projector need to be calibrated as a stereo setup to project accurate information in 3D space. Previous calibration processes require multiple recording and parameter tuning steps to achieve the desired calibration, which is usually time consuming process. In order to avoid such tedious calibration, we train a CNN model to iteratively correct the extrinsic offset given a QR code and a projected pattern. We discuss the overall system setup, data collection for training, and results of the auto-correction model.
translated by 谷歌翻译
Language-conditioned policies allow robots to interpret and execute human instructions. Learning such policies requires a substantial investment with regards to time and compute resources. Still, the resulting controllers are highly device-specific and cannot easily be transferred to a robot with different morphology, capability, appearance or dynamics. In this paper, we propose a sample-efficient approach for training language-conditioned manipulation policies that allows for rapid transfer across different types of robots. By introducing a novel method, namely Hierarchical Modularity, and adopting supervised attention across multiple sub-modules, we bridge the divide between modular and end-to-end learning and enable the reuse of functional building blocks. In both simulated and real world robot manipulation experiments, we demonstrate that our method outperforms the current state-of-the-art methods and can transfer policies across 4 different robots in a sample-efficient manner. Finally, we show that the functionality of learned sub-modules is maintained beyond the training process and can be used to introspect the robot decision-making process. Code is available at https://github.com/ir-lab/ModAttn.
translated by 谷歌翻译
We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation. Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes. Alternate methods treat this as a (probabilistic) 2D synthesis task, and while they can generate plausible 2D images, they do not infer a consistent underlying 3D. However, we find that this trade-off between 3D consistency and probabilistic image generation does not need to exist. In fact, we show that geometric consistency and generative inference can be complementary in a mode-seeking behavior. By distilling a 3D consistent scene representation from a view-conditioned latent diffusion model, we are able to recover a plausible 3D representation whose renderings are both accurate and realistic. We evaluate our approach across 51 categories in the CO3D dataset and show that it outperforms existing methods, in both distortion and perception metrics, for sparse-view novel view synthesis.
translated by 谷歌翻译
The people in the world who are hearing impaired face many obstacles in communication and require an interpreter to comprehend what a person is saying. There has been constant scientific research and the existing models lack the ability to make accurate predictions. So we propose a deep learning model trained on ASL i.e. American Sign Language which will take actions in the form of ASL as input and translate it into text. To achieve the translation a Convolution Neural Network model and a transfer learning model based on the VGG16 architecture are used. There has been an improvement in accuracy from 94% of CNN to 98.7% of Transfer Learning, an improvement of 5%. An application with the deep learning model integrated has also been built.
translated by 谷歌翻译
Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.
translated by 谷歌翻译
Industries must follow government rules and regulations around the world to classify products when assessing duties and taxes for international shipment. Harmonized System (HS) is the most standardized numerical method of classifying traded products among industry classification systems. A hierarchical ensemble model comprising of Bert- transformer, NER, distance-based approaches, and knowledge-graphs have been developed to address scalability, coverage, ability to capture nuances, automation and auditing requirements when classifying unknown text-descriptions as per HS method.
translated by 谷歌翻译
我们提出了一项实证研究,以适应现有的经过验证的文本对文本模型,以备长期输入。通过沿预训练管道的三个轴的全面研究 - 模型架构,优化目标和训练式语料库,我们提出了一种有效的食谱,以从现有的短篇小说模型中构建长篇小说模型。具体而言,我们用汇总仪的块关注替换了变压器中的全部注意力,并使用蒙版的跨度预测任务为模型预算,长度不同。就训练训练的语料库而言,我们发现,与使用通常在其域覆盖范围中通常受到限制的现有长文档语料库相比,使用大型开放域语料库的随机串联的短篇小说可以提高性能。通过这些发现,我们建立了一个长篇文本模型,该模型可以在长篇文本质量检查任务上实现竞争性能,并在五个长文本摘要数据集上建立新的最新技术,通常优于先前的方法,具有较大的模型大小。
translated by 谷歌翻译
当人类掌握现实世界中的物体时,我们经常移动手臂将物体固定在可以使用它的不同姿势中。相比之下,典型的实验室设置仅研究举起后立即研究抓握的稳定性,而没有任何随后的臂重置。但是,由于重力扭矩和握力接触力可能会完全改变,因此抓紧稳定性可能会根据物体的固定姿势而差异很大。为了促进对持有姿势如何影响掌握稳定性的研究,我们提出了Poseit,这是一种新型的多模式数据集,其中包含从抓住对象的完整周期收集的视觉和触觉数据,将手臂重新放置到其中一个采样姿势,并将其重新放置到其中一个采样的姿势中,并摇动物体。使用Poseit的数据,我们可以制定和应对预测特定固定姿势是否稳定的抓握对象的任务。我们培训一个LSTM分类器,该分类器在拟议的任务上达到85%的准确性。我们的实验结果表明,接受Poseit训练的多模式模型比使用唯一视觉或触觉数据具有更高的精度,并且我们的分类器也可以推广到看不见的对象和姿势。
translated by 谷歌翻译